A statistical study of the WPT05 crawl of the Portuguese Web

نویسندگان

  • David Batista
  • Mário J. Silva
چکیده

This article presents a statistical study of WPT05, a text corpus derived from a crawl of the Portuguese Web performed in 2005. This corpus is a valuable resource for researchers in Natural Language Processing (NLP). As one of the biggest publicly available collections of European Portuguese texts, we provide statistical analyses of the contents, covering the languages identified, the representativity of the top-level domains crawled and terms frequency and size. An analysis of an n-grams collection extracted from the Portuguese documents in the corpus is also presented. We analyze the occurrence of first names, surnames and geographic names in the corpus. Since some toponyms are named after personal names, we show the overlap of Portuguese names with geographic entities corresponding to places in Portugal.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Updated Portrait of the Portuguese Web

This study presents an updated characterization of the Portuguese Web derived from a crawl of 48 million contents belonging to all media types (2.5 TB of data), performed in March, 2008. The resulting data was analyzed to characterize contents, sites and domains. This study was performed within the scope of the Portuguese Web Archive.

متن کامل

Introducing the Portuguese web archive initiative

This paper introduces the Portuguese Web Archive initiative, presenting its main objectives and work in progress. Term search over web archives collections is a desirable feature that raises new challenges. It is discussed how the terms index size could be reduced without significantly decreasing the quality of search results. The results obtained from the first performed crawl show that the Po...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

The Effects of the Learning Model, Skilled Model, and Positive Self-review on the Learning of Front Crawl Swimming in Children

One of the main goals of the mission of experts motor learning is maximize the quality of learning experiences and optimize the educational environment .The purpose of this study was focusing on the effects of learning model, skilled model and positive self-review crawl on learning in children aged 9 to 11 years in Alborz Province. Participants of the random and available samples divided into d...

متن کامل

Collecting Statistics about the Portuguese Web

This report presents a characterization of text documents from the Portuguese Web. This characterization was produced from a crawl of over 4 million URLs and 131 thousand sites in 2003. We describe rules that we established for defining its boundaries and the methodology used to gather statistics. We also show how crawling constraints and abnormal situations on the Web can influence the results.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010